285 research outputs found

    Causal Representation Learning Made Identifiable by Grouping of Observational Variables

    Full text link
    A topic of great current interest is Causal Representation Learning (CRL), whose goal is to learn a causal model for hidden features in a data-driven manner. Unfortunately, CRL is severely ill-posed since it is a combination of the two notoriously ill-posed problems of representation learning and causal discovery. Yet, finding practical identifiability conditions that guarantee a unique solution is crucial for its practical applicability. Most approaches so far have been based on assumptions on the latent causal mechanisms, such as temporal causality, or existence of supervision or interventions; these can be too restrictive in actual applications. Here, we show identifiability based on novel, weak constraints, which requires no temporal structure, intervention, nor weak supervision. The approach is based assuming the observational mixing exhibits a suitable grouping of the observational variables. We also propose a novel self-supervised estimation framework consistent with the model, prove its statistical consistency, and experimentally show its superior CRL performances compared to the state-of-the-art baselines. We further demonstrate its robustness against latent confounders and causal cycles

    A mixture of sparse coding models explaining properties of face neurons related to holistic and parts-based processing

    Get PDF
    Experimental studies have revealed evidence of both parts-based and holistic representations of objects and faces in the primate visual system. However, it is still a mystery how such seemingly contradictory types of processing can coexist within a single system. Here, we propose a novel theory called mixture of sparse coding models, inspired by the formation of category-specific subregions in the inferotemporal (IT) cortex. We developed a hierarchical network that constructed a mixture of two sparse coding submodels on top of a simple Gabor analysis. The submodels were each trained with face or non-face object images, which resulted in separate representations of facial parts and object parts. Importantly, evoked neural activities were modeled by Bayesian inference, which had a top-down explaining-away effect that enabled recognition of an individual part to depend strongly on the category of the whole input. We show that this explaining-away effect was indeed crucial for the units in the face submodel to exhibit significant selectivity to face images over object images in a similar way to actual face-selective neurons in the macaque IT cortex. Furthermore, the model explained, qualitatively and quantitatively, several tuning properties to facial features found in the middle patch of face processing in IT as documented by Freiwald, Tsao, and Livingstone (2009). These included, in particular, tuning to only a small number of facial features that were often related to geometrically large parts like face outline and hair, preference and anti-preference of extreme facial features (e.g., very large/small inter-eye distance), and reduction of the gain of feature tuning for partial face stimuli compared to whole face stimuli. Thus, we hypothesize that the coding principle of facial features in the middle patch of face processing in the macaque IT cortex may be closely related to mixture of sparse coding models.Peer reviewe

    Density Estimation in Infinite Dimensional Exponential Families

    Get PDF
    In this paper, we consider an infinite dimensional exponential family, P\mathcal{P} of probability densities, which are parametrized by functions in a reproducing kernel Hilbert space, HH and show it to be quite rich in the sense that a broad class of densities on Rd\mathbb{R}^d can be approximated arbitrarily well in Kullback-Leibler (KL) divergence by elements in P\mathcal{P}. The main goal of the paper is to estimate an unknown density, p0p_0 through an element in P\mathcal{P}. Standard techniques like maximum likelihood estimation (MLE) or pseudo MLE (based on the method of sieves), which are based on minimizing the KL divergence between p0p_0 and P\mathcal{P}, do not yield practically useful estimators because of their inability to efficiently handle the log-partition function. Instead, we propose an estimator, p^n\hat{p}_n based on minimizing the \emph{Fisher divergence}, J(p0p)J(p_0\Vert p) between p0p_0 and pPp\in \mathcal{P}, which involves solving a simple finite-dimensional linear system. When p0Pp_0\in\mathcal{P}, we show that the proposed estimator is consistent, and provide a convergence rate of nmin{23,2β+12β+2}n^{-\min\left\{\frac{2}{3},\frac{2\beta+1}{2\beta+2}\right\}} in Fisher divergence under the smoothness assumption that logp0R(Cβ)\log p_0\in\mathcal{R}(C^\beta) for some β0\beta\ge 0, where CC is a certain Hilbert-Schmidt operator on HH and R(Cβ)\mathcal{R}(C^\beta) denotes the image of CβC^\beta. We also investigate the misspecified case of p0Pp_0\notin\mathcal{P} and show that J(p0p^n)infpPJ(p0p)J(p_0\Vert\hat{p}_n)\rightarrow \inf_{p\in\mathcal{P}}J(p_0\Vert p) as nn\rightarrow\infty, and provide a rate for this convergence under a similar smoothness condition as above. Through numerical simulations we demonstrate that the proposed estimator outperforms the non-parametric kernel density estimator, and that the advantage with the proposed estimator grows as dd increases.Comment: 58 pages, 8 figures; Fixed some errors and typo

    The Optimal Noise in Noise-Contrastive Learning Is Not What You Think

    Get PDF
    Publisher Copyright: © 2022 Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence, UAI 2022. All right reserved.Learning a parametric model of a data distribution is a well-known statistical problem that has seen renewed interest as it is brought to scale in deep learning. Framing the problem as a self-supervised task, where data samples are discriminated from noise samples, is at the core of state-of-the-art methods, beginning with Noise-Contrastive Estimation (NCE). Yet, such contrastive learning requires a good noise distribution, which is hard to specify; domain-specific heuristics are therefore widely used. While a comprehensive theory is missing, it is widely assumed that the optimal noise should in practice be made equal to the data, both in distribution and proportion; this setting underlies Generative Adversarial Networks (GANs) in particular. Here, we empirically and theoretically challenge this assumption on the optimal noise. We show that deviating from this assumption can actually lead to better statistical estimators, in terms of asymptotic variance. In particular, the optimal noise distribution is different from the data's and even from a different family.Peer reviewe

    Sparse Linear Identifiable Multivariate Modeling

    Full text link
    In this paper we consider sparse and identifiable linear latent variable (factor) and linear Bayesian network models for parsimonious analysis of multivariate data. We propose a computationally efficient method for joint parameter and model inference, and model comparison. It consists of a fully Bayesian hierarchy for sparse models using slab and spike priors (two-component delta-function and continuous mixtures), non-Gaussian latent factors and a stochastic search over the ordering of the variables. The framework, which we call SLIM (Sparse Linear Identifiable Multivariate modeling), is validated and bench-marked on artificial and real biological data sets. SLIM is closest in spirit to LiNGAM (Shimizu et al., 2006), but differs substantially in inference, Bayesian network structure learning and model comparison. Experimentally, SLIM performs equally well or better than LiNGAM with comparable computational complexity. We attribute this mainly to the stochastic search strategy used, and to parsimony (sparsity and identifiability), which is an explicit part of the model. We propose two extensions to the basic i.i.d. linear framework: non-linear dependence on observed variables, called SNIM (Sparse Non-linear Identifiable Multivariate modeling) and allowing for correlations between latent variables, called CSLIM (Correlated SLIM), for the temporal and/or spatial data. The source code and scripts are available from http://cogsys.imm.dtu.dk/slim/.Comment: 45 pages, 17 figure
    corecore